Q-DETR: An Efficient Low-Bit Quantized Detection Transformer
31
(1) Quantizing backbone (2) Quantizing encoder
(4) Quantizing MLPs
(3) Quantizing MHA of decoder
(1)
(1) + (2)
(1) + (2) + (3)
(1) + (2) + (3) + (4)
83.3
82.2
81.1
79.3
78.8
4-bit DETR-R50
-1.1
-1.8
-0.5
83.3
80.1
79.3
77.2
76.8
3-bit DETR-R50
-0.8
-2.1
-0.4
FIGURE 2.11
Performance of 3/4-bit quantized DETR-R50 on VOC with different quantized modules.
2a−1 −1, Qw
n = −2b−1, Qw
p = 2b−1 −1 are the discrete bounds for a-bit activations and
b-bit weights. x generally denotes the activation in this paper, including the input feature
map of convolution and fully-connected layers and input of multi-head attention modules.
Based on this, we first give the quantized fully-connected layer as:
Q-FC(x) = Qa(x) · Qw(w) = αxαw ◦(xq ⊙wq + z/αx ◦wq),
(2.25)
where · denotes the matrix multiplication and ⊙denotes the matrix multiplication with
efficient bit-wise operations. The straight-through estimator (STE) [9] is used to retain the
derivation of the gradient in backward propagation.
In DETR [31], the visual features generated by the backbone are augmented with posi-
tion embedding and fed into the transformer encoder. Given an encoder output E, DETR
performs co-attention between object queries O and the visual features E, which are for-
mulated as:
q = Q-FC(O), k, v = Q-FC(E)
Ai = softmax(Qa(q)i · Qa(k)⊤
i /
√
d),
Di = Qa(A)i · Qa(v)i,
(2.26)
where D is the multi-head co-attention module, i.e., the co-attended feature for the object
query. The d denotes the feature dimension in each head. More FC layers transform the
decoder’s output features of each object query for the final output. Given box and class
predictions, the Hungarian algorithm [31] is applied between predictions and ground-truth
box annotations to identify the learning targets of each object query.
2.4.2
Challenge Analysis
Intuitively, the performance of the quantized DETR baseline largely depends on the in-
formation representation capability mainly reflected by the information in the multi-head
attention module. Unfortunately, such information is severely degraded by the quantized
weights and inputs in the forward pass. Also, the rounded and discrete quantization signif-
icantly affect the optimization during backpropagation.
We conduct the quantitively ablative experiments by progressively replacing each module
of the real-valued DETR baseline with a quantized one and compare the average precision
(AP) drop on the VOC dataset [62] as shown in Fig. 2.11. We find that quantizing the MHA